Data Visualization
Data Visualization
Datasets
World Happiness Report 2018 Dataset
| Column Name | Explaination |
|---|---|
| Rank | Overall happiness ranking |
| Country | Country name |
| Score | Happiness score |
| GDP_Per_Capita | Economic contribution to happiness score |
| Social_Support | Social contribution to happiness score |
| Healthy_Life_Expectancy | Health contribution to happiness score |
| Freedom_To_Make_Life_Choices | Freedom contribution to happiness score |
| Generosity | Generosity contribution to happiness score |
| Perceptions_Of_Corruption | Trustworthiness contribution to happiness score |
| Residual | Portion of happiness score that is not attributed to any of the listed categories |
vec.len indicates how many ‘first few’ elements are displayed of each vector. You can leave it to the default value. I have set this argument to 1 for better output file formating.
happy.df <- read.csv("../data/WorldHappiness2018_Data.csv")
str(happy.df, vec.len=1)## 'data.frame': 156 obs. of 10 variables:
## $ Rank : int 1 2 ...
## $ Country : Factor w/ 156 levels "Afghanistan",..: 45 106 ...
## $ Score : num 7.63 ...
## $ GDP_Per_Capita : num 1.3 ...
## $ Social_Support : num 1.59 ...
## $ Healthy_Life_Expectancy : num 0.874 0.861 ...
## $ Freedom_To_Make_Life_Choices: num 0.681 0.686 ...
## $ Generosity : num 0.192 0.286 ...
## $ Perceptions_Of_Corruption : Factor w/ 111 levels "0.000","0.001",..: 107 103 ...
## $ Residual : Factor w/ 145 levels "0.383","0.675",..: 141 131 ...
Reference:
Wages and Education of Young Males Datasets
| Column Name | Explaination |
|---|---|
| nr | Identifier |
| year | Year |
| school | Years of schooling |
| exper | Years of experience (\(=\)age\(-6-\)school) |
| union | If wage is set by collective bargaining |
| ethn | Ethnicity |
| maried | If married |
| health | If he has health problems |
| wage | Log hourly wage |
| industr | Industry that he was in |
| occupation | Occupation |
| residence | Residence location |
str(Males, vec.len=1)## 'data.frame': 4360 obs. of 12 variables:
## $ nr : int 13 13 ...
## $ year : int 1980 1981 ...
## $ school : int 14 14 ...
## $ exper : int 1 2 ...
## $ union : Factor w/ 2 levels "no","yes": 1 2 ...
## $ ethn : Factor w/ 3 levels "other","black",..: 1 1 ...
## $ maried : Factor w/ 2 levels "no","yes": 1 1 ...
## $ health : Factor w/ 2 levels "no","yes": 1 1 ...
## $ wage : num 1.2 ...
## $ industry : Factor w/ 12 levels "Agricultural",..: 7 8 ...
## $ occupation: Factor w/ 9 levels "Professional, Technical_and_kindred",..: 9 9 ...
## $ residence : Factor w/ 4 levels "rural_area","north_east",..: 2 2 ...
NYC Flights Data in 2013
A data frame contains all 336,776 flights departing from New York City in 2013.
| Column Name | Explaination |
|---|---|
| year, month, day | Date of departure |
| dep_time, arr_time | Actual departure and arrival times (format HHMM or HMM), local tz. |
| sched_dep_time, sched_arr_time | Scheduled departure and arrival times (format HHMM or HMM), local tz. |
| dep_delay,arr_delay | Departure and arrival delays, in minutes. Negative times represent early departures/arrivals. |
| hour, minute | Time of scheduled departure broken into hour and minutes. |
| carrier | Two letter carrier abbreviation. See airlines() to get name |
| tailnum | Plane tail number |
| flight | Flight number |
| origin, dest | Origin and destination. See airports() for additional metadata. |
| air_time | Amount of time spent in the air, in minutes |
| distance | Distance between airports, in miles |
| time_hour | Scheduled date and hour of the flight as a POSIXct date. Along with origin, can be used to join flights data to weather data. |
str(flights)## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 19 variables:
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr "UA" "UA" "AA" "B6" ...
## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num 1400 1416 1089 1576 762 ...
## $ hour : num 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
ggplot2: Data Visualization
The last step before exploratory data analysis (EDA) is visualization. Base R offers many tools to you to have a good look into data by creating simple plots. However, ggplot2 is much more elegant and versatile.
Graphing Template
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))All ggplot2 commands can be thought as following this template. It starts with the ggplot function, followed by a string of geom functions. All functions are connected by +.
It is a good practice to pass the common dataset to the ggplot function rather than later. In geom functions, mapping argument requests a list of aesthetic mappings to use for plot, which is typically returned by aes function. Generally, we do not need to care about the details behind this. It is sufficient just to treat mapping = aes(<MAPPINGS>) as one complete structure.
Note that DO NOT PUT + SIGN IN THE BEGINNING OF A NEW LINE. The plus sign has to come at the end of a line.
One Variable
Bar Charts
geom_barby default makes the height of the bar proportional to the number of observations in each group.
ggplot(data = Males) +
geom_bar(mapping = aes(x = industry, fill=maried)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) #rotate x label by 90 degreescount is information derived from original data. In other words, statistical transformation, or more specifically, counting (stat_count) happens in geom_bar function. Therefore, the following graph is identical to the graph above.
ggplot(data = Males) +
stat_count(mapping = aes(x = industry, fill=maried)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) #rotate x label by 90 degreesPosition Adjustment 1: dodge
By default, bars are stacked if each group specified by the x variable can be divided into subgroups by additional information that we provide, which is maried in this example. If you prefer to places overlappping objects side by side, pass position = "dodge" to geom_bar.
ggplot(data = Males) +
geom_bar(mapping = aes(x = industry, fill=maried), position="dodge") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) #rotate x label by 90 degreesPosition Adjustment 2: fill
If you prefer proportion to count, (in this example, you are more concerend with the proportion of single men in each industry rather than the number), try position = "fill".
ggplot(data = Males) +
geom_bar(mapping = aes(x = industry, fill=maried), position="fill") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) #rotate x label by 90 degreesDensity Plot and Histogram
ggplot(happy.df, mapping = aes(x = Healthy_Life_Expectancy)) +
geom_density(kernel='gaussian') +
geom_histogram(mapping = aes(y=..density..), bins=20, alpha=0.5)geom_density: talk about kernel geom_histogram: bins and binwidth
Two Variables
Scatterplots
Scatterplots are most useful to see the relationship between two continuous variables.
Suppose one would like to find out the relationship between money and happiness based on this dataset. The first step is usually plotting a scatterplot (dot plot) to have a sense about the trend.
As introduced previously, the dataset happy.df is passed to ggplot. The function geom_point is the function used to create scatterplots. In this example, the function aes helps to specify which column is on the x-axis and which on y-axis.
ggplot(data = happy.df) +
geom_point(mapping = aes(x = GDP_Per_Capita, y = Score))According to the data, it looks like money and happiness do have quite strong positive correlation.
Scatterplots are flexible. Additional to the location of points, marker type (shape), marker size (size), marker color (color) and transparency (alpha) can also be used to encode information.
shapecan only be used to represent discrete values, whilecolor,alphaandsizeare good for both discrete and continuous values. All these attributes go inside theaesfunction.
The following example uses transparency to represent life expectancy.
ggplot(data = happy.df) +
geom_point(mapping = aes(x = GDP_Per_Capita, y = Score, alpha = Healthy_Life_Expectancy))Attributes defined outside of mapping do not carry information on dataset. Their values have to be provided externally.
ggplot(data = happy.df) +
geom_point(mapping = aes(x = GDP_Per_Capita, y = Score), color = "blue", shape=7)Ideally, scatterplots are for two continuous variables. However, it can also be used to compare one continuous variable and one categorical variable. As you can see, this is not ideal because many points are overlapping since they are condensed on limited choices of experience values.
ggplot(data = Males) +
geom_point(mapping = aes(x = exper, y = wage, color=maried))To mitigate this problem, setting position='jitter' to add a small amount of random variation to the location of eachi point.
ggplot(data = Males) +
geom_point(mapping = aes(x = exper, y = wage, color=maried), position='jitter')Alternatively, use geom_jitter, which is a convenient shortcut for geom_point(position = 'jitter').
ggplot(data = Males) +
geom_jitter(mapping = aes(x = exper, y = wage, color=maried))Quiz: What goes wrong in the this plot?
ggplot(data = happy.df) +
geom_point(mapping = aes(x = GDP_Per_Capita, y = Score, color = "blue"))Lines
ggplot(data = happy.df, aes(x = Score, y = GDP_Per_Capita)) +
geom_point() +
geom_line()ggplot(data = happy.df, aes(x = Score, y = GDP_Per_Capita)) +
geom_point() +
geom_smooth()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Rug Plots
A rug plot is a compact visualisation designed to supplement a 2d display with the two 1d marginal distributions. Rug plots display individual cases so are best used with smaller datasets.
ggplot(data = happy.df,
mapping = aes(x = GDP_Per_Capita, y = Score)) +
geom_point() +
geom_rug(sides = "bl")Boxplots
ggplot(flights) +
geom_boxplot(mapping = aes(x = carrier, y = air_time), na.rm = TRUE)How to read boxplot?
Violin Plots
ggplot(flights) +
geom_violin(mapping = aes(x = carrier, y = air_time), na.rm = TRUE)ggplot(flights, aes(x = carrier, y = air_time)) +
geom_boxplot(na.rm = TRUE) +
geom_violin(na.rm = TRUE)2D Density Plots
ggplot(happy.df, aes(x=Healthy_Life_Expectancy, y=GDP_Per_Capita)) +
geom_density2d()Hex Plot
ggplot(happy.df, aes(x=Healthy_Life_Expectancy, y=GDP_Per_Capita)) +
geom_hex(binwidth=c(0.2, 0.5))Facets: Groups of Scatterplots
Besides using aesthetics attributes to add additional, such as alpha and shape, when dealing with categorical variables, one can also split plots into facets, so that we have a group of scatterplots, each represents one group.
Suppose we would like to find out how experience affects wage for the two ethnic minority groups across 12 industries. We can create 12 scatterplots, one for each industry, instead of plotting one comprehensive scatterplot with all information, which is most likely very messy.
To create this facet, call facet_wrap() after geom_point() or geom_gitter(). The first argument of face_wrap() is a formula. (Formula is a data structure in R, which can be seen as an expression with ~). ~ industry tells R to create a facet according to levels in industry.
Males %>%
filter(ethn != "other" ) %>%
ggplot() +
geom_jitter(mapping = aes(x = exper, y = wage, color=ethn), alpha=0.5) +
facet_wrap( ~ industry, nrow=4)It is easy to draw some preliminary conclusions according to the facet. For example, most observations are from manufacturing and trade. In finance industry, a black man usually earns more given a same year of experience, according to the dataset.
You can also create a facet based on combination of levels among multiple discrete variables. To do this, put ~ between variable names. For example, instead of encoding ethnicities as colors, I create a facet based on the combination of ethnicity and industry.
Males %>%
filter(ethn != "other" ) %>%
ggplot() +
geom_jitter(mapping = aes(x = exper, y = wage), alpha=0.5) +
facet_wrap(ethn ~ industry, nrow=4)It is obvious that the variable passed to
facet_wrap()should be discrete.
Coordinate System
ggplot(data = Males) +
geom_bar(mapping = aes(x = school, fill=occupation)) +
coord_flip()ggplot(data = Males) +
geom_bar(mapping = aes(x = school, fill=occupation)) +
coord_polar()Geometric Objects
ggplot(data = happy.df, mapping = aes(x = GDP_Per_Capita, y = Score)) +
geom_point() +
geom_smooth()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = happy.df, mapping = aes(x = GDP_Per_Capita, y = Score)) +
geom_point() +
geom_smooth()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = Males, mapping = aes(x = exper, y = wage, color=maried)) +
geom_point() +
geom_smooth()## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Map
map.world <- map_data('world')happy.df <- happy.df %>%
mutate(Country = as.character(Country)) %>%
mutate(Country = if_else(Country == "United States", 'USA',
if_else(Country == "United Kingdom", 'UK',
Country)))map.df <- left_join(map.world, happy.df, by = c('region' = 'Country'))ggplot(data = map.df, aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = Score)) +
scale_fill_viridis() +
theme_bw() +
labs(title = "Happiness Score by Country", subtitle = "Wold Happiness Report 2018")The Google API
As of mid-2018, the Google Maps Platform requires a registered API key. To use the Google Maps service, you are required to register an API. Go to the API registration page, check all map services and follow the instruction. The geocoding API is free if you remain in the free tier. Nevertheless you need to associate a credit card with the account.
register_google("your.api.key")countries_loc <- geocode(c("Hong Kong", "New York, USA", "Tokyo, Japan", "London",
"Singapore", "Shanghai", "Toronto", "Zurich", "Beijing",
"Frankfurt"))
countries_lon <- countries_loc$lon
countries_lat <- countries_loc$latggplot(data = countries_loc) +
borders("world", fill = "grey", colour = "grey") +
geom_point(mapping = aes(x = countries_lon, y = countries_lat, color="red")) +
scale_fill_viridis() +
theme(legend.position="none") +
labs(title = "Financial Centers Distribution",
subtitle = "According to Global Financial Centres Index (2007–ongoing)")Summary
| Type of Plots | Geom Functions |
|---|---|
| Scatterplots | geom_point, geom_jitter |
Quick Reference
Shape
ggplot2 cheatsheet https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
ggplot2 extensions https://www.ggplot2-exts.org/